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We report on the progress and status of the APEmille project: a SIMD parallel computer with a peak perfor- 
mance in the TeraFlops range which is now in an advanced development phase. We discuss the hardware and 
software architecture, and present some performance estimates for Lattice Gauge Theory (LGT) applications. 



1. Introduction 

The APE project is one of a small number of 
physicist-driven efforts in Europe, Japan |E| and 
the US aiming at the development of num- 
ber crunchers optimised for LGT applications. 
APEmille, the latest generation of the APE fam- 
ily, is an evolution of the APE100 architecture || 
and has a roughly one order of magnitude higher 
peak performance. 

APEmille is a versatile massive-parallel pro- 
cessor optimised for heavy duty numerical ap- 
plications and scalable from 4 GigaFlops to 1 
TeraFlops. The main architectural enhancements 
over the previous generation are the availability 
of local addressing on each node, the support of 
floating-point (FP) arithmetics in double preci- 
sion and for complex data types, and the reali- 
sation of a multiple SIMD architecture. It allows 
to divide the machine into independent partitions 
running different programs and one program can 
also activate multiple SIMD threads. These en- 
hancements make the machine potentially use- 
ful in a broader class of applications that require 
massive numerical computations and enjoy some 
degree of locality, e.g. complex systems and fluid 
dynamics. 

In this paper, we describe some details of the 
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APEmille architecture and the status of its hard- 
ware implementation. We also discuss main soft- 
ware elements and performance aspects. 

2. Architecture and Hardware 

APEmille is organised as a three-dimensional 
array of nodes, each consisting of an arithmetic 
processor and its local data memory. Each group 
of eight nodes is controlled by a control processor 
and operates in SIMD mode with local address- 
ing: the nodes synchronously execute the same 
instructions and may access their memories using 
different locally calculated addresses. The small- 
est machine partition that can run a program in- 
dependently has eight nodes. Larger SIMD parti- 
tions are obtained by running an identical instruc- 
tion stream on the corresponding control proces- 
sors. The arithmetic processors can directly ac- 
cess the data memories of remote nodes by means 
of a flexible communication network. 

The hardware building-block of APEmille is a 
processing board (PB). It houses eight nodes (log- 
ically arranged at the corners of a cube), one con- 
trol processor, and a communication controller. 
In addition, each PB has its own PCI-based 
host-interface. A large APEmille machine is as- 
sembled by replicating these PB's (connected by 
communication- lines and global control-signals). 
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The replication of the control processor and host 
interface on each board not only allows sim- 
ple software configuration of the machine into 
multiple SIMD partitions, but also provides a 
highly simplified and modular system-level hard- 
ware with improved reliability of the global sys- 
tem. 

The control processor (called "Tarzan") has 
its own memory for data (512 kByte SRAM) 
and program (512 k instructions DRAM). Tarzan 
manages the program flow (branches, loops, sub- 
routine and function calls etc.) and generates the 
global memory address for the arithmetic proces- 
sors. Tarzan contains two separate units with 
short pipelines for integer arithmetics and address 
generation. They use a large register file (256 
words a 32 bit) and provide special instructions 
e.g. for efficient loop control (increment, com- 
parison and conditioned branch) within a single 
clock cycle. 

The arithmetic processor (called "Jane") sup- 
ports several data types: Floating-point normal 
operations (a x b + c) are implemented for 32- 
and 64-bit IEEE format, as well as for single 
precision complex and vector (pairs of 32-bit) 
operands. Arithmetic and bit- wise operations 
are available for integer data types. Operands 
can be converted between the various formats. 
The combined double pipeline of Jane allows to 
start one arithmetic operation in every clock cy- 
cle (with pipeline length of 9, as in APE 100, or 
less). This corresponds to two, four, or eight 
Flop per cycle for normal operations with sim- 
ple (SNORM or DNORM), vector (VNORM) or 
complex (CNORM) operands, respectively. The 
key element for an efficient filling of the arith- 
metic pipeline is the large register file with 512 
words a 32 bits, which avoids the need for an in- 
termediate cache between memory and registers. 

Jane manages its own local data memory, a 
synchronous DRAM with 4 Mwords a 64 bit. 
The local address generation unit, which is in- 
dependent of the arithmetic operations of Jane, 
combines the global address received from Tarzan 
with a local address offset on the node. The mem- 
ory is organised in two interleaved banks to allow 
page-fault free access for data bursts up to the 
size of the entire register file. The bandwidth for 



the transfer between memory and registers is one 
(two) operands a 64 (32) bits per clock cycle and 
start-up latencies between subsequent bursts can 
be reduced to a few cycles by partial overlapping. 

Memory access to remote nodes is automati- 
cally handled by the communication network. It 
provides homogeneous communications over ar- 
bitrary distances, i.e. all nodes can simultane- 
ously access data from a corresponding remote 
node with a given relative distance. The commu- 
nication network automatically exploits multiple 
data paths for communications along more than 
one coordinate axes and it allows broadcast over 
the full machine, as well as along lines and planes 
of nodes. "Soft" or inhomogeneous communica- 
tions, where the relative distance to the remote 
node may be different on each node, are forseen 
as a future extension. 

Tarzan, Jane, and the communication con- 
troller are built into three custom designed ASIC 
circuits with very low power consumption. The 
APEmille chips will probably be built with two 
successive technologies, increasing the clock fre- 
quency. The first APEmille systems will operate 
at 66 MHz providing 500 MFlops per processor. 
Upgraded to 100 MHz, the full machine with 2048 
nodes would then deliver a peak performance of 
1.6 TeraFlops. 

APEmille is hosted by a cluster of networked 
PCI-bascd PC's or workstations. Each host con- 
trols a group of 32 memory-mapped nodes. The 
hosts are connected by a dedicated high-speed 
low-latency network based on serial Motorola 
"AutoBahn" links j| or by any standard network 
interface. The close integration of APEmille into 
a host network allows a high I/O bandwidth and 
flexible re-configuration of the machine and the 
peripherals. 

3. Software and Simulation Environment 

The software developed for APEmille has three 
main parts, the operating system (OS), the com- 
pilation chain (TAOmille), and development and 
simulation tools. 

The OS is distributed on the hosts running a 
freely available Linux Kernel. The OS handles 
the program loading, local and global data I/O, 
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system services, and the partitioning of the ma- 
chine. The OS has a layered and object-oriented 
structure written in CH — h, and uses an innova- 
tive client server approach which supports the 
creation and management of distributed objects. 

The dynamic programming language TAOmillc 
supports the new features of the architecture 
(new data types, local addressing, extended com- 
munications) but maintains full compatibility 
with the language of APE100. An important new 
element of the compiler is a high-level optimisa- 
tion stage before the assembler code generation. 
It eliminates e.g. common subexpressions and 
performs optimisations for efficient address cal- 
culations and memory access. A modular struc- 
ture of the compilation chain allows future exten- 
sions to other programming languages by chang- 
ing only the uppermost level of the compiler. 

A freely configurable simulator written in CH — h 
is available for detailed behavioural simulations 
of all APEmille components. It is used for auto- 
matic generation of test stimuli and reference out- 
puts for the electrical VHDL simulation, for tests 
of the compilation chain and OS, and for perfor- 
mance estimates of realistic application codes. 

4. Performance Considerations 

Tarzan as well as Jane use a very long instruc- 
tion word, in which the operations of all inde- 
pendent devices are specified at each clock cycle. 
This allows a highly efficient scheduling and op- 
timisation of the arithmetic and memory-access 
pipelines by the low-level optimiscr in the last 
step of the compilation chain. 

Various new controller instructions and their 
improved temporisation allow in general shorter 
latencies for address calculations and should 
improve related performance bottle-necks in 
APE 100. The local addressing and the more flex- 
ible communications will be helpful for various al- 
gorithmic problems, like preconditioning or FFT. 

Maximal FP performance can be obtained in 
APEmille by the use of complex normal oper- 
ations (8 Flop/cycle) for which the ratio be- 
tween Flop rate and memory bandwidth is as 
well-balanced as in APE100. While for other 
operand types the maximal Flop rate of Jane is 



smaller (1/2 for VNORM, 1/4 for DNORM and 
SNORM), the correspondingly increased band- 
width/Flop rate can be helpful for applications 
that are less balanced than LGT computations. 

Rough performance estimates with simple and 
un-tuned assembler code for the Wilson-Dirac op- 
erator yield 88% pipeline filling (760 clock cycles) 
when using CNORM instructions, correspond- 
ing to 60% peak performance (subtracting use- 
less Flops from e.g. real x complex factors). This 
shows that no severe limitations from the mem- 
ory bandwidth are encountered. For double pre- 
cision a pipeline filling of 82% (1150 clock cycles) 
has been found, corresponding to 21% of single- 
precision peak performance. This indicates that 
less than the naive factor of 4 is lost with double 
precision arithmetics. 

5. Status Summary 

The main elements of the software are available 
and tested on the system simulator: A prototype 
of the distributed OS is running, and the low level 
of the new compilation chain is used extensively 
for the compilation of test programs. 

The detailed electronic design of APEmille is 
almost completed and some VLSI components are 
ready to be produced. Extensive tests of the over- 
all design are presently performed on the simu- 
lator. A register-file test chip is the first VLSI 
component available so far, and further hardware 
components of the system will be tested during 
the next months. We expect to have first proto- 
type systems with several APEmille boards early 
in spring next year and a medium size machine 
for the next summer. 
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